The universality of iterated hashing over variable-length 

strings 



Daniel Lemire 3 '* 

"L1CEF, Universitedu Quebec a Montreal (U QAM), 100 Sherbrooke West, Montreal, QC, H2X3P2 Canada 



Abstract 

Iterated hash functions process strings recursively, one character at a time. At each 
iteration, they compute a new hash value from the preceding hash value and the next 
character. We prove that iterated hashing can be pairwise independent, but never 3- 
wise independent. We show that it can be almost universal over strings much longer 
than the number of hash values; we bound the maximal string length given the collision 
probability. 
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1. Introduction 

We consider hash functions mapping variable-length strings to L-bit integers. They 
have numerous applications from indexing — e.g., with hash tables J3] [9j [20] l33ll and 
Bloom filters fl24ll — to spell-checking 1 13 1, compression ifTTl and cryptography ll26l . 

We consider hash functions h picked randomly from a family !H [4J. We focus 
on iterated hash functions IfTTl |23l : given a string s\S2 • • • s„, starting from an initial 
value (or seed) Hq, the hash value of the whole string, H n , is computed recursively 
from a compression function F as Hi = F(Hi—i,si) for i — 1,2,... ,n. Thus, a hash 
function is defined both by an initial value Hq and a compression function F. A typical 
example is Carter- Wegman Polynomial Hashing over finite fields 14*1 1121 . that is, hash 
functions of the form h(s) — Y.it n,s i f° r some randomly chosen element t . Many hash 
functions over variable-length strings are iterated including Pearson hashing ET1I321 . 
SAX and SXX [25] as well as the hash functions commonly used in C++ and Java. 
In cryptography, iterated hashing is also known as Merkle-Damgard |16| hashing; it 
includes the popular functions MD4, MD5, SHA-0 and SHA-1. 

Good hash functions are such that hash values appear random. Formally, a family 
is (pairwise) universal if the probability of a collision is no larger than if the hash 
values were random: P{h(s) — h(s')) < 1/2 L for s ^ s\ It is e-almost universal OTI 
(or e-AU) if the probability of a collision is bounded by e < 1 . Furthermore, a family is 
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Table 1: Upper bounds on the string length for arbitrarily large iterated hash families. 
We write LCM 2 l for the least common multiple of the integers from 1 to 2 L . 
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New bounds (§ 
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£-wise independent if given k distinct elements s^ l \s^\ . . . ,s^ k \ their hash values are 
independent: 

p(h{ S W)=yM A^ (2) )=J (2) A-Ah(sW)=yW)=± 

for any hash values y^\y^ 2 \ ■ ■ ■ ,y^ k '. Pairwise (or 2-wise) independence implies pair- 
wise universality and thus, it is also called strong universality. 
The main contributions are as follows. 

• We show that iterated hashing cannot be 3-wise independent (§ |3j. Thus, we 
have to be satisfied with pairwise independence. We can get 3-wise indepen- 
dence if we consider a generalization of iterated hashing where there is a new 
compression function with each iteration in the computation. However, 4-wise 
independence remains impossible. 

• We show that pairwise independence is possible for iterated hashing by present- 



ing the Tabulated family (§ 5.2 1 



We show that almost universality is possible for strings longer than 2 L characters, 



e.g., with the Pearson family (§ 5.4 1 



• Iterated hashing families have limited cardinality: there are only so many possi- 
ble compression functions. This limits their universality. We apply results from 
Nguyen and Roscoe [18] and Stinson OTl to derive new bounds (§|6]l. To make 
one such bound tighter, we use the fact that pairwise independent families must 
have permuting compression functions, a concept we introduce in §|4] 



We can derive tighter bounds using the innate limitations of iterated hashing 
(§[7). Table [T] summarizes some of our results. 



2. Preliminaries 

We want L-bit integer hash values: hash functions map elements to integers in 
{0, 1, . . . ,2 L — 1}. The weakest property we require from a family is uniformity: all 
hash values are equiprobable. That is a family is uniform if P{h(s) — y) — f° r 
any s and y. (We pick h uniformly at random from the family 9{.) It is not difficult 
to construct such a family. For example, let H be the set of all 2 L distinct constant 



functions (h(s) = c for all s). This family is uniform but a poor choice in practice 
because any two elements s and s' are sure to collide: P (h(s) = h(s')) = 1. 

Thus, we commonly seek families satisfying stronger conditions. A family is e- 
almost universal if the probability of a collision is bounded by e < 1: P(h(s) = h(s')) < 
e for any s and s' . We say that the family is universal when it is 1 /2 L -almost universal. 

While bounding the probability of a collision is sufficient for some applications 
like conventional hash tables, other results require stronger properties. A family is 
pairwise independent (or strongly universal) if the hash values of any two elements are 
independent: 

p(h{ S )=yAh{s')=y') = ± z 

for any two distinct elements s. s' and any two hash values y,y' . It is 3-wise independent 
if the hash values of any three elements are independent: 

P(h(s)=yA h(s') = / Afc(/') - /') - ^ 

for any three distinct elements s,s ! ,s" and any three hash values We can gen- 

eralize this definition to k-wise independence. A family which is k-wise independent 
for any k is fully independent. (A family is trivially &-wise independent over a set con- 
taining less than k distinct elements. In our work, we implicitly assume that there are 
at least k distinct elements whenever we consider £-wise independence.) 

We have that fc-wise independence implies k— 1-wise independence for k>2. For 
example, suppose we have 3-wise independence. Then we have that 

2 L -l 

P(h(s)=yAh(s l )=y') = £ P (h(s) = y A h{s') = / Ah(s") = y") 

v"=0 

= *x±-± 
2 3i 2 2L ' 

Similarly, £-wise independence for any k > 1 implies uniformity. 

A family that is universal, but not strongly universal might be XOR universal if the 
bitwise exclusive OR of hash values appears random, that is P(h(s) © h(s') = y) = 1 /2 L 
for all distinct elements s,s' and hash values y lfTTl[T8ll . (The symbol © is the bitwise 
exclusive OR.) 

We can weaken both strong universality and XOR universality: 

• Given e < 1, a family is e-almost strongly universal if it is uniform and if 
P(h(s) = y Ah(s ! ) = y') < e/2 L for any two distinct elements s,s ! and any two 
hash values y,y' . A 1 /2 L -almost strongly universal family is strongly universal. 

• Similarly, it is e-almost XOR universal (or e-AXU) if the probability P(h(s) © 
h(s') — y) is bounded by e. 

We have that e-almost strong universality implies e-almost XOR universality which 
itself implies e-almost universality. (See FigJT]) 



almost universal 



almost X OR un iversal 




Figure 1: Visual summary of the main properties related to universality. Strong uni- 
versality is synonymous with pairwise independence. 

We can also generalize almost universality to k-wise almost universality: a family 
is k-wise e-almost universal lfT8l if the probability of a £-way collision is bounded 
by e for some e < 1. That is, if we have P(h(s^) = h(s^) = ... = h(s^)) < e as 
long as the k elements sW,S",.--,^*' are distinct. We have that &-wise e-almost 
universality implies k+ 1-wise e-almost universality. E.g., almost universality implies 
3-wise almost universality. 

3. Iterated hashing is pairwise independent at best 

We write the concatenation ab of two strings a and b as a || b = ab. If is the empty 
string, then | a = a. We begin by characterizing iterated functions. 

Proposition 1 Consider hash functions over all strings, including the empty string. 
The following statements are equivalent: 

• 9f is a family of iterated hash functions; 

• For any h € 9~C, whenever h{s) = h(s') for a pair of strings s,s', then h(s \\ s") = 
h(s' || s") for any string s" . 

Proof. By induction, iterated hash functions satisfy the second point. Indeed, suppose 
that h 6 9f is an iterated function with a corresponding compression function F . Let 
s'l be the i character of the string s" . By appending the first character of s" to both 
strings (s and s') we get a collision: h(s \\ s'{) = F(h(s),s") = F(h(s'),s") = h(s' \\ s"). 
We can then append the remaining characters of s" one by one starting with s'{ and 
finally prove that h(s \\ s") = h{s' \\s"). 

Conversely, suppose the second point is true. Pick h £ 9f. We want to construct 
a corresponding compression function F . For any hash value y in the domain of h, 
there is at least one string p y such that h(p y ) = y. Let F (y, a) = h(p y || a) for all 
characters a. By the second point, F is well defined: its definition is independent 



of the choice of p y . We can verify that the iterated hash function with compression 
F and initial value Hq = /z(0) agrees with h on all strings which concludes the proof. □ 

Families of iterated hash functions have limited independence. The next lemma 
shows that they are pairwise independent at best. Moreover, almost strong universality 
(and thus strong universality) requires a non-fixed initial value. 

Lemma 1 Iterated hashing cannot be 3-wise independent, unless we bound the string 
length to two characters. Moreover, almost strong universality is impossible with a 
fixed initial value unless we bound the string length to one character. 

Proof. We prove the first statement by contradiction. Suppose that an iterated family 
9{ is 3-wise independent. By definition, we must have 

P(h(a) = yAh(ab)=yAh(abb) = y) = ^ 

for any hash values y, and any characters a and b. (We allow a = b.) However, the 
family is also pairwise independent so that P(h(a) = y A h(ab) =y) = . However, if 
h(a) = y and h(ab) —y then the compression function satisfies F(y,b) = y and therefore 
/i(abb) = y. Hence we conclude that 

= P(h{a)=yAh{ab)=y) 

= P(h(a) =yAh(ab) = yA/z(abb) =y) 
1 

a contradiction when L > 1 . 

For the second statement, suppose that the family is e-almost universal for 
e < 1. Let the fixed initial value be Hq. If the family is pairwise independent 
then P(h(a) = Hq A/i(aa) = Hq) < e/2 L . Moreover, because almost strong uni- 
versality implies uniformity, we have that P{h(a) — Hq) — 1/2 L . Because h is 
iterated, we have that h(a) — Hq implies that the compression function satisfies 
F(Ho,a) = Hq. Hence we have that h(a) = Hq implies /i(aa) = Hq. It follows that 
P(h(a) = Hq A h(aa) = H ) = P(h(a) = H Q ) = 1 /2 L and therefore the family cannot 
be pairwise independent because 1 /2 L > e/2 L . □ 

To allow better independence, we consider generalized iterated hash functions 
where a new compression function is used for each new character: Hi = Fi(Hi-i,Si) 
for i = 1,2, ...,n. A family of generalized hash functions is such that whenever 
h(s) = h(s') for a pair of strings s, s' having the same length, then h(s \\ s") — h(s' || s") 
for any string s". 

It includes hashing by multilinear functions over finite fields ||4] [28]], e.g., hash 
functions of the form h(s) = mi +^ i m i+ is i - with randomly generated values 1111,1112, . . . 
(henceforth MULTILINEAR). The compression functions are F(y,c) = y + mj + \c with 
an initial value of m\. The computation is in the finite field F p : characters are mapped 
to elements of F ; ,. For example, we can choose the field of cardinality p = 2 L : the 



polynomials with binary coefficients modulo p(x) — where p(x) is an irreducible poly- 
nomial of degree L. We write F 2 l = G¥(2)[x]/ p(x). That is, integers in [0,2 L ) are 
represented as polynomials of degree L — 1 having binary coefficients. Addition or 
subtraction is the bitwise (or term-wise) exclusive OR. Multiplication by x is just 
the left shift, unless the left-most bit is 1, in which case the left shift must be fol- 
lowed by the addition with p(x). Exhaustive lists of irreducible polynomials are avail- 
able online [27|. Otherwise, when p is a prime number, we merely have to compute 
Fi (y, c) = nij + iy + c mod p using the usual integer algebra. We might prefer finite fields 
that have prime cardinality close to 2 L . For example, we can set p to some Mersenne 
prime such as 2 17 — 1, 2 31 — 1 or 2 61 — 1, or other convenient prime such as 2 32 — 5 or 
2 64 -59 Ifl2l . 

Lemma 2 MULTILINEAR is pairwise independent if we forbid strings ending with the 
value zero. 

Proof. We have that /i(aO) = h(a) so universality is impossible if we allow strings 
to end with the value zero. So let s, s' be two distinct strings of lengths \s\ and 
ending with non-zero values. Assume without loss of generality that \s\ > \s'\. Given 
h(s') = y, we can solve for mi as a function of y', s ! and the values TO2,/«3, . . . ,m\ s i\. 

If s is longer than s' (that is \s\ > \s'\), then we can solve for mu in h(s) — y as a 
function of y, s and m\,m?,,. . . ,m^_ l . In turn, if we substitute our solution for m\, we 
have mu as a function of y, /, s, s', and all nti for i y 1, \s\. 

If s = s', then there must be some j such that sj y s'-. Hence, we can solve for ntj in 
h(s) — /z(Y) = y — y as a function of y, y', s, s' and all m,- for i y l,j after substituting 
the solution for mi from h(s') =y'. 

Thus, in either case, among all possibles values of m\,m2, ■ ■ ■ >wi|j|, 
two values are fixed by h(s) = y A h(s') = y 1 . Hence, we have that 
P(h(s) — y A h(s') — y') — p\ s \~ 2 /p\ s \ — l/p 2 where p is the cardinality of the 
field. This proves pairwise independence. □ 

If we choose zero as an initial value mi = 0, then MULTILINEAR is still XOR 
universal when p — 2 L . However, it fails to be pairwise independent. Indeed, given 
a, b two distinct elements of F p , we cannot satisfy both h(a) = mi +m 2 a = a and 
h(h) = mi + mjh = a unless m\ y 0. 

Multilinear has a nearly optimal memory-universality trade-off. Indeed, Stin- 
son OTl showed that pairwise independent families must have cardinality at least 
1 +a(b— 1) where a is the number of strings and b is the number of hash values. 
There are p" — 1 strings of length at most n in ¥ p ending with a non-zero value. Thus, 
any pairwise independent family from strings in F p of length bounded by n to elements 
in F p must have at least 1 + (p n — — 1) hash functions. That is, its cardinality is in 
Cl(p n+1 ). Meanwhile, there are p" +l different hash functions in MULTILINEAR when 
strings have length at most n. 

We can have better universality than MULTILINEAR, at the cost of a higher mem- 
ory usage. Consider sequences of 3-wise independent hash functions hi from char- 
acters to L-bit integers. The Zobrist P6l [37l family of string hash functions h(s) = 
h\(si) ® h2{sz) ® ■■■ ® h n (s n ) is 3-wise independent El Q4] [34). It is also an exam- 
ple of generalized iterated hashing with the compression function Fi(y,c) = y®hj(c). 



Moreover, it has optimal independence since generalized iterated families are 3-wise 
independent at best according to the next lemma. 

Lemma 3 Generalized iterated hashing cannot be 4-wise independent unless we 
bound the string length to one character. 

PROOF. Consider any generalized iterated hash function h. If h(s) = h(s') for any two 
strings s and s' of the same length then h(s || a) = h(s' || a) for any character a. Hence, 
assuming that the family is 4-wise independent, we have that 



whenever z 7^ z', a contradiction. □ 

In the following sections, we consider solely conventional iterated hashing. 

4. Pairwise independence requires permuting compression functions 

Permuting compression functions F are such that y h-» F(y,c) is a permutation of 
the hash values y € [0,2 L ) for any character c and any compression function. Hence, if 
y^y' then F (y,c) 7^ F(y' ,c) when F is permuting. 

(It is not necessary for F to permute all integer values in [0,2 L ). For example, 
the hash function mapping all strings to a constant (h(s) = z) has a corresponding 
compression function F which is defined only as F(z,c) — z for all characters c. It is 
trivially permuting over a single hash value z.) 

We have that XOR universality or pairwise independence implies a fixed collision 
probability of 1 /2 L between distinct strings. This, in turn, implies permuting compres- 
sion functions by the next Lemma. 

Lemma 4 An iterated hash family with fixed collision probability (P{h{s) = h{s')) = e 
for s 7^ s') over strings of length two or more has permuting compression functions. 

PROOF. Consider two distinct strings s,s'. Consider any iterated hash family with fixed 
collision probability e. We have 



Thus, we have h(s) 7^ h(s') => h(s || c) 7^ h(s' | c) which proves the result. □ 

Consider the consequences of this lemma for L = 1. Over {0, 1}, there are only 
two permutations: the identity and an exchange of the two values (0 and 1). Hence, we 
have that F(F(y, a),b) = F(F(y,b),a) if F is permuting. Therefore, the strings ab and 




= 



8 



= P(h(s || c) = h{s' || c)) 

= P(h(s) = h(s'))+P{h(s) ^ h(s') A h(s \\ c) = h{s \\ c)) 
= e + P(h(s)^h(s') A h(s\\ c)=h(s'\\ c)). 



Table 2: Universality of some iterated families: n is the maximal string length 



family universality 

CWPOLY n /2 L -almost XOR universal 

Tabulated pairwise independent for n < L 

ShiftTabulated pairwise independent on last L — n + l bits 

Pearson on unary strings max,^,, £/(/)/2 L -almost universal 



Table 3: Computational complexity (per character) and memory usage in bits of some 
iterated families. There are |E| characters in the alphabet. 



families 


complexity 


memory usage 


CWPOLY 


0(LlogL2°( lo s* L )) 


L 


Tabulated and ShiftTabulated 


0{L) 


\L\L 


Pearson 


0(L) 


log(2 L !) <L2 L 



ba always collide (/z(ab) = /i(ba)). Thus — in general — XOR universality or pairwise 
independence over strings longer than L characters is impossible. 

However, permuting compression functions have benefits on their own, beside be- 
ing a consequence of pairwise independence. Consider any fixed permuting compres- 
sion function F and any string s. Consider any two distinct initial values Hq and Hq. 
Then the hash value of s computed with Ho must differ from the hash value computed 
with Hq by induction on the number of characters in the string s. Thus, we have the 
following result. It holds true for strings of arbitrary length. 

Lemma 5 An iterated hash family with permuting compression functions and indepen- 
dently chosen equiprobable initial values is uniform. 

A compression function is strongly permuting if it is permuting and if F(y,c) = 
F(y,c r ) implies c = c'. Strong permutation means that strings having a Hamming dis- 
tance of one never collide. Of course, this precludes pairwise independence. 

Lemma 6 Given a strongly permuting iterated hash family, two strings differing by 
exactly one character never collide. 

5. Iterated hash families over variable-length strings 

We are interested in hashing variable-length strings using iterated hash functions. 
Let E be the set of all characters from which the strings are constructed; the number of 
distinct characters is |E|. We present a range of iterated families (see Tables [2] and |3). 
Other hash families appear in Appendix [A] 

5.1. Carter-Wegman Polynomial Hashing 

Carter and Wegman [4 | defined an almost universal family of iterated hash func- 
tions (henceforth CWPOLY) using polynomials over a finite field ¥ p as h(s) — £,f" ~ l Si 



for t 6 F p randomly chosen and where s, is the / character of the string s. In other 
words, we use the compression function F(y,c) =ty + c. Characters are interpreted as 
elements of F p . We are especially interested in binary fields where p = 2 L . 

Unfortunately, CWPOLY with an initial value of zero is such that h(QQ) =h(Q) = 0. 
To fix this problem, we need to choose a non-zero initial value [12] such as 1. Thus, 
we have /i(00) = t 2 , h(0) = t and h(223) = t 3 + 2t 2 + 2t + 3. Therefore, given two 
strings s and /, h(s) — h(s') is a non-zero polynomial of degree at most max(|s|, \s'\) 
where \s\ and \s'\ are the lengths of strings s and s' . By the Fundamental Theorem of 
Algebra, such a polynomial has at most max(|s|, solutions. Thus, the probability of 
collision between two strings of length at most n is at most 'j y This probability bound 

is tight. Indeed, consider the polynomial of degree n over F p , x(?) = n/'=o( r ~~ ')■ 
E.g., x(f) = t 3 +2t for n = 3 in F3. It has the n distinct roots 0, 1, . . . , n — 1. If s is 
the character repeated n times, and s' is the string corresponding to the coefficients 
of the polynomial l(t), we have that h(s) — h(s') — z(t), hence P(h(s) = h(s')) = 
Moreover, Nguyen and Roscoe show that CWPOLY has optimal universality given the 
size of its family [18]. Sadly, CWPOLY is not uniform, but the probability of any hash 
value y is bounded: P(h(s) = y) < ^ for any string s of length n. 

When the field size is a power of two (p = 2 L ), then CWPOLY is n/2 L -almost 
XOR universal. Indeed, we have that P(h(s) ® h(s') = y) (for any y) is given by the 
probability that h(s) + h(s') =y as polynomials in G¥(2)[x]/ p(x). Yet h(s) +h(s') is 
a non-zero polynomial of degree at most max(|s|, in t and the result follows again 
by the Fundamental Theorem of Algebra. 

Unfortunately, CWPOLY cannot be almost strongly universal over variable-length 
strings even if we use a random initial value. Indeed, consider the equations h(aa) = a 
and /i(a) = 0. They can be written explicitly as Hot 2 + at + a = a and Hot + a = 
where Ho is the initial value. We see that h(a) = implies h(aa) — a. Therefore, 
if CWPoly was e-almost strongly universal for e < 1, we would have that e/2 L > 
P(h(aa) — a A h(a) = 0) = P(h(a) = 0) = 1 /2 L , a contradiction. It is still possible 
to modify CWPOLY so that it becomes almost strongly universal. Unfortunately, the 
result is not an iterated family by our definition. To get this stronger property, we 
append to each string a random character before hashing. We state the general result 
over Fp, but it obviously applies when p = 2 L . 

Lemma 7 Consider the family CWPOLY with the parameter t chosen randomly 
among the non-zero values ofF p and an initial value of 1. Moreover, we choose a 
random value ^ in F p and append it to all strings before hashing them: 

!=1 

If we consider strings of length at most n, then this modified CWPOLY is (n+l)/(p — 
\)-almost strongly universal. 

Proof. With the random parameter ^, CWPoly is clearly uniform. 



It remains to show that 



P(h(s)=yAh(s')=y)<j^— 

{P- 1 )P 

for any two distinct strings s,s' and any hash values y,y'. We have that 
h(s) — h(s') —y — y' is a non-zero polynomial of degree at most n + 1. (E.g., if 
s = ab and s' = cd then h(s) — h(s') —y — y' = t 2 (a — c) +f(b — d) —y — y'-) To see 
why it must be a non-zero polynomial, consider two cases. If s and s' have the same 
length, let i be such that Sj ^ sj then h(s) — h(s') —y — y' as a polynomial over the 
variable t has (s,- — as its i + 1 th term. If s and 5' have different lengths, assume 

without loss of generality that s is longer, then the |s| + 1 th term of the polynomial 
is f H +1 and therefore the polynomial has degree |j| + 1 and is non-zero. Hence, the 
polynomial has at most n + 1 roots. Moreover, we have that h(s) — h(s') —y — y' is 
independent from Given any t such that h(s) — h(s') —y — y' = is satisfied, there is 
only one value £ (dependent on t) such that h(s) — y. Thus there are at most n+1 pairs 
of values t,C, such that t 7^ 0, h(s) — h(s') —y — y' and h(s) — y are satisfied. Therefore, 
the probability that h{s) = y and h(s') — y' are both true is bounded by ^zjjj, because 
there are p — 1 possible non-zero values for t and p possible values for □ 



When L bits fit in a processor register, the running time of the compression func- 
tion F may be considered independent of L. More formally, however, the mul- 
tiplication between two L-bit integers required by the compression function is in 

<9(LlogL2°( lo s* L )) Q. 



5.2. Iterated string hashing by tabulation 

Hashing by tabulation (4) [3] [T4l |34) has good universality, at the expense of the 
memory usage. We adapt this strategy to iterated hashing of variable-length strings. 

Consider Y, a randomly chosen function from characters in E to L-bit hash val- 
ues. There are 2 L I I I such functions. We consider integers as elements of F 2 l, that is, 
as polynomials with binary coefficients in GF(2)[x]/ p(x) where p(x) is an arbitrarily 
chosen irreducible polynomial of degree L. Consider the family of iterated hash func- 
tions with compression functions of the form F(y,c) = xy + F(c) and an initial value 
chosen randomly (henceforth TABULATED). The compression function is permuting. 
Tabulated is pairwise independent. 

Proposition 2 TABULATED is pairwise independent for strings no longer than L char- 
acters. 

Proof. Consider two distinct strings s,s' no longer than L characters. We want to 
show that P(h(s) = y A h(s') =y') = 1 /2 2L for any hash values y,y' . 
Consider the equation h(s) =y or 

\s\ 

H x^ +J>H-''r(s/) = y 

i=l 



where Hq is the initial value. We can solve for the initial value as a function of y and s: 

Ho= X -^(y~^-r(si))- (1) 
i=i 

We solve for T(sj) for one value of j, in terms that do not depend on Hq. This will 
allow us to conclude the proof. We consider two cases: 

• Suppose that the strings have the same length = \s'\). From h(s) — y and 
h(s') = y', we get that h(s) — h(s') — y — y 1 ■ We have that the equation h(s) — 
h(s') = y — y is independent from Hq because the terms Hqx' s ' and Hqx' s ' cancel 
out. Because the strings are distinct, there must be an index j G {1,2,..., |s|} 
such that sj =^ s'-. Let the character sj occur at indexes r\,T2, ■ ■ ■ in string s 
(by definition j G {r\,T2, ■ ■ ■ , r^}) and at indexes r'j , rj, . . . , rj in string s' . If we 
let q = £ m x' 1 '~ r " 1 — L m -*' s '~ r " ! , me equation h(s) — h(s') = y — y 1 can be written 
as qT(sj) — X for some value A, G F 2 l independent from r(*;) and Hq. Because 
j is in {r\,r%, ■ ■ ■ , r^} but not in {r[ , r' 2 , . . . , rj}, and because all |s| — r,„'s and all 
\s\ — r' H 's are less than L, we have that q ^ 0. 

• Suppose that the strings have different lengths. Without loss of generality, as- 
sume that |s| > From h(s) — y and h{s') — y', we get that (h(s) — y) — 
jcM-M _/) = 0. The equation (h(s) -y) -.^H*'! (/,(/) _/) = is in- 
dependent from Hq because terms Hox' s ' and Hqx^' I x jtrH* I cancel out. Con- 
sider the character s\ s \ and seek all indexes where it appears: write these in- 
dexes r\ , r%, . . . , rk for string s (by definition \s\ G {ri,ri, ■ ■ ■ , rk}) and r'j , r^, ■ ■ ■ , r 1 , 
for string s' . We have that q — Y 1 m x ^ r,n ~ Lm x ' s ' ^" is non-zero because is 
in {\s\ - r u \s\ -r 2 ,...,\s\- r k } but not in {\s\ - r[,\s\ - r' 2 , . . . ,\s\ - r',} since 
tin — k'l < 1*1 f° r a ^ m s ' anc l because the |s| — r m 's and the \s\ — r' n 's are less 
than L. For the rest of the proof, we set 7 = 1. 

Hence we can solve for T(sj) as T(sj) — q~ l X whether the two strings have the same 
length or not. Equation [T] gives Hq as a function of r(s,) for i = {1,2, . . . , So, our 
formula for Hq depends on T{sj), but we can substitute T(sj) — q~ l X in this formula 
(Equation [T|) to get an expression for Hq which does not depend on T(sj). Thus, from 
the equations h(s) = y and h(s') — y', we get one and only one value for Ho and T(sj) 
as a function of the other tabulated values and of y and /. Both values are chosen at 
random among 2 L values and thus the result is shown. □ 

For binary strings, TABULATED has nearly optimal memory-universality trade-off. 
Indeed, there are 2 L+l — 2 binary strings of length at most L. Hence, any pairwise 
independent family over such binary strings must contain at least 1 + (2 L+1 — 2)(2 L — 
1) hash functions [31"]. Therefore, its cardinality must be in Q.(2 2L ). Meanwhile, TAB- 
ULATED has 2 2L hash functions over binary strings. 

5.3. Iterated string hashing by shifted tabulation 

While Tabulated is pairwise independent, it requires operations in finite fields 
of cardinality 2 L . Thankfully some microprocessors have instructions for computations 



in such finite fields (8). Yet it can be expensive on some computers, even with such 
instructions. Fortunately, pairwise independence is possible without finite fields |6j. 

The barrel or circular shift is the invertible operation by which all bits all shifted, 
except for the last ones, which are brought back at the beginning. For L-bit values, 
the barrel left shift by one can be written as y O 1 = (y <C 1) (y 3> L — 1) where <C 
and ^> are the left and right shifts. E.g., 11001 becomes 10011. Barrel shifting can be 
implemented efficiently in hardware [1|. The popular x86 and ARM instruction sets 
offer the ror instruction for this purpose. 

Consider the hash family (henceforth ShiftTabulated) with compression func- 
tions of the form F(y,c) = (y 1) ©T(c) where T is a randomly chosen function from 
characters to L-bit hash values. (Whether we choose the barrel left or right shift is 
arbitrary.) We choose the initial value randomly. Because the compression function is 
permuting, ShiftTabulated is uniform. 

The ShiftTabulated compression functions can be computed efficiently: one 
value to look-up, one barrel shift and one bitwise XOR. Moreover, ShiftTabulated 
can be described using the same compression function as TABULATED — F(y, c) = xy + 
r(c) — in GF(2)[xr]/ (x L + 1). (The polynomial x 1 + 1 fails to be irreducible and thus, 
GF(2)[x]/(x L + 1) is merely a ring.) 

The proof of the pairwise independence of TABULATED (see Proposition [2]) relies 
on the fact that any non-zero element of the field GF(2)[x]/ p(x) invertible, which in- 
cludes any non-zero polynomial of degree at most L— 1 from GF(2)[x]. In turn, this 
means that given any polynomial q of degree at most n — 1 < L, and any element r of 
the field, we have 

p{qr{ Si ) = r ) = 2- L 

whenever T(s ,•) is picked at random in the field, because the equation is only true when 
r(i,-) = q~ l r. The same is almost true in GF(2)[x]/ (x L + 1). We write that two values 
are equal modulo the first n — 1 bits if we ignore the first n — 1 bits in the comparison. 
By Corollary 1 from Lemire and Kaser (2010) fl4l . we have that 

P(qT(sj) = r mod first n - 1 bits) = 2~ L+ "- 1 . 

Hence, by a proof similar to Proposition|2j we have the following result. 

Lemma 8 ShiftTabulated is pairwise independent on the last L — n+l bits for 
strings no longer than n characters. 

5.4. Pearson hashing 

We define Pearson hashing by the family of compression functions F(y,c) =A y(Sc 
where A is an array containing a permutation of the values in {0, 1, ... ,2 L — 1} lETI . 
These compression functions are strongly permuting: A yec — Ay ffic implies y — /, 
A y(Sc =A y(Sc i implies c = c'. We pick the initial value uniformly at random. Thus, Pear- 
son is uniform. To our knowledge, the exact universality of Pearson remains unknown. 
(We know that it can never be strongly universal because its compression function is 
strongly permuting.) 



For L = 2, Pearson is 5/6-almost universal for strings no longer than four. That is, 
it is universal for strings of length 2 L unlike C WPOLY. Brute-force numerical investi- 
gations for large values of L is difficult. 

To simplify the analysis, we focus on unary strings: strings made of a single char- 
acter (such as aaaa). The following result is an upper bound on the universality of 
Pearson over general strings. 

Proposition 3 Pearson is ^.-almost universal over unary strings of length at most nfor 
e = maXi <n d(i) /2 where d(i) is the divisor function — the number of positive integers 
dividing i. 

Proof. Consider strings made of the character a. Let Jt be the permutation A. ffia . 
Fix the initial value Ho. A collision between two unary strings of lengths k : k' < n is 
equivalent to the equation K k Ho = % k Hq =>■ k'Hq = Hq for \k — Jd\ = I < n. Consider 
any solution G3 of this equation (GS'Hq = Hq), then let a be the smallest integer such 
that G3 a Z/o = Hq. We bound the number of solutions using the fact that o must divide I. 

Given a, there are (2 L — 1) ! solutions to the equation K a Ho = Hq subject to k'Hq ^ 
Hq for i < a. Indeed, we have that kHq can be any value, except Hq — thus we have 
2 L — 1 possibilities. We have that k 2 Hq can be any value expect Hq and hHq, hence 
we have 2 L — 2 possibilities. And so on, up to k^Hq which is predetermined. At that 
point, we have enumerated (2 L — 1)(2 L — 2) ■ ■ ■ (2 L — <j+ 1) possibilities. Each one of 
these possibilities define how the values Hq,kHq, . . . ,7i° _1 //o are permuted. The other 
2 L — a values can be permuted to any available value, generating (2 L — a) ! possibilities 
for (2 L — 1)! possibilities. 

Thus there is a total of d(l) (2 L — 1 ) ! solutions and (2 L ) ! different permutations: the 
ratio is 

d{l){2 L -l)\ _ d(l) 

This is true for any initial value Hq. The string length difference I ranges between 1 
and n — 1: we must keep the maximum value max,<„rf(j) /2 L . □ 

The function max i<n <i(i) grows slowly (see Fig. [2J. Formally, we have that 
max !< „£/(/) G o(/ £ ) for all e > 0. For max,<„af(/) to be equal to 2 L — so that Pearson 
is no longer almost universal over unary strings — we need n to be larger than the least 
common multiple of the integers from 1 to 2 L . Thus, Pearson can be almost universal 
over unary strings much longer than 2 L characters. 

To prove that iterated hashing for non-unary strings longer than 2 L characters is 
possible, we consider the following variation on Pearson hashing (henceforth GENER- 
ALIZED Pearson): we pick compression functions of the form F(y,c) —A yec where 
A is a random array containing values in {0, 1 , . . . , 2 L — 1 } (not necessarily a permuta- 
tion). For L — 1, we have that h(QQ) — h(ll) with probability one. However, for L = 2, 
Generalized Pearson is almost universal for strings longer than 2 L (see Table^. 
We computed these probabilities by enumerating all possible compression functions 
and all possible pairs of strings with L-bit characters. 
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Figure 2: Plot of max i<n d(i) where d is the divisor function 



6. Bounding the universality of iterated hashing by the maximal family size 

Given only the number of hashable items, the number of hash values and the num- 
ber of hash functions, we can bound the universality. To be £-almost universal, a family 
must have enough hash functions. 

There is a limited number of iterated hash functions. Compression functions have 
2 L |E| possible inputs. For each input, there are 2 L possible hash values. Thus, there are 
no more than 2 L2t l E l compression functions. Moreover, there are no more than 2 L initial 
values Hq. Thus the number of iterated hash functions is bounded by 2 L ' 2 \ L \ +l ~>; 



There are |E|" + H \-\L\ possible non-empty strings. 

Nguyen and Roscoe derived a bound which is particularly suited for values of £ 
larger than 1 /2 L JTSj . Let 2 K be the number of hashable items. Pick any hash function. 
By the pigeon-hole principle, there must at least 2 K /2 L hashable items colliding to 
the same value. Apply any other hash function to these colliding items, there must 
be 2 K /2 2L items colliding on these two hash functions. By repeating this argument, 
Nguyen and Roscoe show that e-almost universal hashing requires \K/L — l]/e hash 
functions 1 18 1. 

Corollary 1 (Nguyen and Roscoe) Given at least 2 hashable items, then z-almost 
universal hashing for £ > requires at least \K/L — 1] /e hash functions. 

Proof. We have X — 2 r hash functions. Theorem 1 from Nguyen and Roscoe lfl8l 
states that when K is a multiple of L then r > log(e _1 ( f — 1)) and r > log(e _1 |_f J ) 
otherwise. Their result may be rewritten as r > log(e _1 ( \j- — 1] )), irrespective of the 
value of K. Thus, we have X = 2 r > — 1] /e. It proves the result. □ 



\M\ < 2 
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Table 4: Numerically-derived upper bound on the collision probability between strings 
of length at most n under GENERALIZED PEARSON (L — 2, 2 L — 4) 



n collision probability 

2 0.53 

3 0.72 
4 = 2 L 0.84 

5 0.88 

6 0.89 

7 0.95 

8 > 0.97 

9 > 0.98 

10 >0.99 

11 1.00 



By combining this last corollary with our bound on the number of iterated hash 
functions (see Equation [2), we get the following bound on the universality of iterated 
hashing. 

Lemma 9 At best, iterated hashing might be e-almost universal over the strings of 
length at most L(s2 L( - 2 + +1 ) + I) for some 8 < 1. 

PROOF. According to Corollary [I] we have that \!H\ > \K/L— l]/e. Solving for K 
in this expression, we get K < L(z\9f\ + 1). Meanwhile, for string of length at most n 
over |E| distinct characters, we have |E|" < 2 K or nlog |E| < K. Hence, by combining 
these two inequalities, we have nlog |E| < K < L(e\9{\ + 1) or just 

nlog|E| <L(e|#] + l). 

Moreover, we can bound the size of an iterated family as < 2 Ll ^ L ^ +l ' > (see 
Equation |2j. Hence, by substitution, we have 

»log|E| < L{z\tf\ + \) 

< L(e2 L ^ L ^ + l). 

Finally, we can solve for n in this inequality. Thus — at best — iterated hashing might 
be almost universal over strings of length at most L(e2 L ( 2 + l)/log|E| where 
|E| > 1. This bound grows exponentially with |E| which is misleading because 
universality over a large alphabet (|E| large) implies universality over a smaller 
alphabet. Indeed, it is always possible to restrict the application of a universal family 
to strings using few characters, and this restriction may only increase the universality. 
Thus, it is preferable to set |E| = 2. (For |E| = 1, we get a weaker bound: we have that 

2 K > n and H < 2 LlL so that the bound becomes n < 2 Le2i2i+1 .) □ 

For bounding universality, that is l/2 L -almost universality, it is preferable to use 
bound provided by Stinson iFJTIl to get the following result. 



Lemma 10 At best, iterated hashing might be universal over the strings of length at 
most 2L + L2 L+l . If the family is strongly universal, then it is limited to strings of at 
most L + 2 log 2 L I — log (2 L — 1 ) — 1 charact ers. 

PROOF. First, we consider universal hashing. By Equation]^ we have that 2 L ^ L ^ +1 ' > > 
\9f\. Stinson [31] proved that the size of universal families must be at least as large as 
the number of hash values divided by the number of elements, thus we have \9{\ > 
\L\ n /2 L . By combining these two inequalities, we get 

2 i(2 i |I| + l) > |^| > | E |«/ 2 L 



or 

2 L(2 L \X\+i) > | £ |»/ 2 i 

Taking the logarithm on both sides, we get 

L(2^+2) 

The right-hand-side of this last inequality grows with |E|, but a family universal over a 
large alphabet must be universal over a smaller alphabet as well. Thus we set |£| = 2. 
This proves the first part of the lemma. 

Consider strongly universal hashing. Recall that Stinson [31] proved that strongly 
universal families must have cardinality at least 1 + a(b — 1 ) where a is the number of 
strings and b is the number of hash values. Hence, we have that \9f\ > 1 + |E|"(2 L — 1). 
As a consequence of Lemma|4] strongly universal hash families must have permuting 
compression functions. There are (2 L !)I E I such functions, and 2 L possible initial values 
for a total of at most 2 L x 2 i !l I l hash functions. Hence, we have2 L x2 L !l £ l > By 
combining these inequalities, we get 

2 L x2 t ! |i| > i + |E|"(2 L -l). 



We can drop the constant term 1 from the right to get 

2 L x 2 L 'I Z I 



2 L -l ' 
As before, we can set |E| = 2 to get 

«<L + 21og2 L !-log(2 L -l) 



This concludes the proof. □ 



7. Limitations of iterated hashing over long strings 



To characterize the limitations of iterated hashing — irrespective of the family size, 
we want to compute a bound on the string length given a desired bound e on the colli- 
sion probability. 

Let s r ,a be the unary string made of the character a repeated r times. For exam- 
ple, we have S3 a = aaa. Because we have at most 2 L distinct hash values, we have 
that h(s 2 L +l a ) must be equal to h(s rA ) for some re {1, . . . ,2 }. Hence, we have the 
following lemma. 

Lemma 11 For any iterated hash function h and any character a, the values h(s Ka ) are 
cyclic over r > 1 with a period T G {1,2, ... , 2 L } except maybe for the first 2 L — T hash 
values. 

Proof. In the 2 L + 1 hash values h(s na ) for r £ {1,2, . . .,2 L ,2 L + 1}, one value must 
be repeated because there are at most 2 L distinct hash values. Write h(s ri>a ) = h(s ri:& ), 
then by Proposition [I] ft(s ri+! - ja ) = h(s r2+ j. a ) for any non-zero integer i. Without loss 
of generality, assume r 2 > r\. This proves that h(s ri+x a ) is cyclic in x with period at 
most T = \r\ — r 2 \. We see that 1 < T < 2 L . Only the h(si_ a ) for i = 1,2, . . .,r\ — 1 are 
excluded from our analysis. This concludes the proof. □ 

Let LCM/c = LCM({ 1,2, . . . , k}) be the least common multiple of the integers from 
1 to k, inclusively. For example, we have LCM 2 = 2,LCM 4 = 12, LCM& = 840. By 
definition, for any T € {1,2,..., k}, we have that T divides LCM^. Thus, the strings 
s 2 l a and s 2 l +lcm ^ l a collide with probability one under iterated hashing by Lemma|TTj 
Using a generalized argument, we have the following proposition. 

Proposition 4 We have the following results concerning iterated hashing over 
variable-length strings: 

• Almost universality over strings of length up to 2 L + LCM 2 l is impossible; 

• Universality over strings of length up to 2 L + 2 is impossible; 

• For 1 /2 L < e < 1 such that 1 /e is not an integer, Z-almost universality over 
strings of length at most 2 L +LCM 9 l +1 _^/ e j is impossible. 

Proof. Since the values h(s r+2 L_ l a ) must have period T e {1, . . . ,2 } as functions 
of r, and T must divide LCM 2 l, we have that h(s 2 i a ) = h(s 2 L +LCM ^ L a ) for all iterated 



hash functions h (see Lemma Hi. This proves the first result. 

We prove the last result. Suppose that hashing is e-almost universal. Then the prob- 
ability that h(s r+2 L_i a ) is cyclic in r with period T is bounded by e: P(period(/i) = 
T) < e. Thus, we have P(period(/z) G {2 L - j + 1 , 2 L - j + 2, . . . , 2 L }) < je for any 
integer j between 1 and 2 L . Because P(period(/i) e {1,2, . . . ,2 L }) = 1, we have 
P(period(/i) < 2 L - j) > 1 - je. Setting j = [1/eJ - 1, we have that f(period(/i) < 
2 L - j) > 1 - (U/eJ - l)e = e - [1 /eje > e. Thus, the probability P(/i(s 2 l J = 
h( s LCM L +2 L ,a)) > e which concludes the proof of the last item. 

2 +1 — [1/eJ 



Table 5: Comparison of the bounds on universality from Lemma 



10 



and Proposition 4 
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The second result follows because universality implies 1/2 L + 8-almost univer- 
sality for all 8 > 0. We can find 8 sufficiently small, such that 1 /e is not an integer, 
and such that |l/ e J = 2 L — 1. Thus, we have that universality over strings of length at 
most 2 L + LCM2 = 2 L + 2 is impossible. This concludes the proof. □ 

We have that LCM 2 l divides 2 L \ so LCM 2 l <2 L l; moreover, by a standard identity 
2 L \ < (2 L ) 2i — 2 L2L . Hence, the bound on almost universality from this last proposition 
is preferable to the cardinality-based bound (see Lemma|9]l. Similarly, we compare the 
bounds on universality in Table [5] the new bound of 2 L + 1 characters is much smaller. 

At least for unary strings, the next lemma shows that the almost universality bound 
of Proposition's tight (up to one character). 

Lemma 12 There exists an almost universal iterated family over unary strings of 
length at most 2 L +LCM 2 l — 2. 



Effectively, hj goes from to 2 L — 1 for strings of length to 2 L — 1, and then it 
becomes cyclic with period T (see Fig. [3j. This family is iterated. 

We want to show that there is no pair of strings s na , s/ a for r, r' < LCM 2 l + 2 L — 1 
such that hr( s r,a) — ^r(Va) f° r a U T G {1,2, . . . ,2 L }. Suppose that it is false. It 
cannot happen if r, r 1 < 2 L since hr(s rta ) — ^r(v a) would imply r = r 1 . Suppose that 
hr (s r ,i) = h T (v,a) for all T e { 1 , 2, . . . , 2 L }, and for some 2 L <r< LCM 2 l +2 l -2 and 
some r 1 < 2 L . Then /ir('Sr'.a) = i 3 — \. This would imply that hj(s re ) is independent 
of T which is not possible for 2 L < r < LCM 2 l +2 l because /ir(i r ,a) is cyclic with 
period T. Similarly, for 2 L < r,r' < LCM 2 l +2 l — 1, the equality hj(s ra ) = hj(s r r a ) is 
possible only when T G {1,2, .. . ,2 L } divides \r — r l \. But since \r — r 1 ] < LCM 2 l, this 
is not possible for all T < 2 L . 

The result is shown. □ 



Proof. For r e {1,2,.. . ,2 L }, define 




r ifO<r<2 L 
2 L — T — (r — 2 L mod T) otherwise 



Finally, we show that the universality bound of Proposition [4] is tight for unary 
strings (up to two characters). To prove the result, we build a perfect (collision-free) 
hash function over unary strings of length at most 2 L . Consider strings made of the 



1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 
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Figure 3 : Functions r — > hx (s r ,a) f° r L = 2 and 7 = 1,2,3,4 

character a. Let jt be a cycle of length 2 L over the integers {1,2,..., 2 L }. For example, 
let k be the permutation that takes k to k + 1 for k <2 L and 2 L to 1 . We choose the 
hash function h(s na ) = n' 1. A collision between any two strings of length at most 2 L 
implies that tz'Hq = Hq for some I < 2 L which is impossible because Jt is a cycle of 
length 2 L . Thus, no two unary strings of length at most 2 L may collide under this hash 
function. 

8. Conclusion 

We have shown that iterated hashing can be pairwise independent over short strings. 
Moreover, iterated hashing can be almost universal over strings longer than the number 
of hash values. Motivated by this result, we have derived bounds on the universality 
of iterated hashing. We can construct large iterated hashing families: this might sug- 
gest that a very high degree of universality is possible. Alas we have shown that this 
expectation would be misguided 

We have identified two open problems which we find interesting. On the one hand, 
we lack a bound on the universality of iterated hashing given the size of the family. 
Our bounds assume arbitrarily large families. On the other hand, we are still missing 
provably optimal iterated families. For example, is it possible to construct a family 
which pairwise independent for strings longer than L for some values of L > 1 ? Future 
work might consider the specific limitations of other hashing strategies ll29l[3D . 
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A. Other popular iterated hash functions 

For completeness, we review some popular iterated hash families and functions. 
We show that many of these families have permuting compression functions. 

A.l. Hashing by random irreducible polynomials 

Instead of using a fixed irreducible polynomial, as in CWPOLY, we can pick the 
irreducible polynomial at random [fTOl [30il (henceforth Division). Considering L-bit 
characters as polynomials having binary coefficients (GF(2) [jc]), we use the compres- 
sion function F (y, c) — yx L + c. As in C WPOLY, we specify an initial value of 1 . Thus, 

given a string s, the hash value is x" L + s\x^ l ^ L H \-s n mod p{x). Consider two 

strings of length at most n. The equation h(s) = h(s') is true in GF(2)[x]/ p(x) only if 
the non-zero polynomial of degree at most nL formed by h(s) — h(s') is divisible by 
p{x). The polynomials of degree nL have at most n irreducible factors of degree L. 
Meanwhile, there are at least (2 L — 2 L / 2+1 )/L irreducible polynomials p(x) of degree 
L[22\. Thus, the probability of a collision is no larger than 2L 2^/2+1 • 

We can extend this analysis to show almost XOR universality with the same bound 
( L ^72+1 ) ■ Pi c k an Y value y, then the probability P(h(s) © h(s') — y) is given by the 
probability that h(s)+h(s') —y = in G¥(2)[x]/ p(x). The polynomial h(s)+h(s') —y 
in GF(2) [x] has degree at most nL but at least L, and the result follows. 

The compression function of DIVISION is (strongly) permuting: F(y,c) = F(y' ,c) 
implies x L y = x L y' mod p(x) which implies y = y' . Moreover, if we forbid the character 
value zero at the beginning of strings, and pick the initial value randomly, we have that 
Division is uniform and thus, 2L -almost strongly universal. 

We might be able to compute DIVISION faster than CWPOLY. However, selecting 
a random irreducible polynomial might be slow. To cope with this problem, Shoup 
introduced a generalized DIVISION 1 30 1 with compression function F(y,c) = yx L l k + c 
and p(x) chosen as a monic irreducible polynomial of degree L/k. 

A.2. Bernstein hashing 

Bernstein proposed a computationally efficient compression function J2J: F(y,c) — 
((y <C /) + y) © c where y <C / is the left shift by / bits. For all I > 0, this compression 
function is strongly permuting. Hence, given randomly chosen initial values, we have 
uniform hashing. 



A. 3. Fowler-Noll-Vo hashing 

There are two types of Fowler-Noll- Vo hash functions {19\ . The FNV-1 compres- 
sion functions takes the form F(y,c) — (yp) ffic where p is a prime number. It is a 
generalization of Bernstein hashing. The FNV-1 a compression functions are of the 
form F(y,c) = (y®c)p for some prime p. Both FNV-1 and FNV-1 a are strongly per- 
muting. 

A.4. SAXandSXX 

The shift-add-xor (SAX) scheme |25l is defined by the compression function 
F(y,c) =y(B ((y <C /) + (y 3> r) +c) where y 3> r is the right shift by r bits. For L = 32 
and 7-bit characters, Ramakrishna and Zobel [25 1 reported that SAX is empirically 
universal for 4 < / < 7 and 1 < r < 3, and a randomly chosen 32-bit initial value. They 
found that the alternative, shift-xor-xor (SXX), F (y, c) = y © ((y < I) © (y > r) © c), 
is not competitive. 

A.5. String hashing functions in common programming languages 

Strings are commonly used as keys in hash tables. Thus, most programming lan- 
guages include string hashing functions. We consider C++ and Java. 

ISO added support for hash tables to the C++ language (unordered_map) [35 1. 
Implementations of the language are required to provide a string hashing function, 
but the exact function is unspecified. However, a popular compiler (GNU GCC, ver- 
sion 4.1.1) implemented it as an iterated hash function with the compression function 
F(y,c) — 5y + c mod 2 32 and an initial value of zero. For example, the hash value of 
the one-character string z is 122, the decimal value corresponding to the character z. 

The Java String class has a specified hashCode method. As of version 1.3 of the 
language, it is an iterated hash function with compression function F(y,c) = 31y + 
c — using int arithmetic — and an initial value of zero. Because Java lacks unsigned 
integers as a native type, the hash value of a sufficiently long string (e.g., zzzzzz) 
can be a negative integer. (Java uses the Two's complement binary representation, so 
that signed integers are interchangeable with unsigned integers as long as we only use 
addition, subtraction and multiplication.) 

A.6. PowerOfTwo hashing 

In light of the hash functions used in Java and C++, consider the family given 
by the compression function F(y,c) =By + c mod 2 L (henceforth PowerOfTwo). 
If the initial value is zero, then h(0Q) = h(0) = 0, but we can fix this problem by 
using a non-zero initial value. However, suppose that B is even and consider any two 
strings s and s' of length greater than L and differing only in the first character, then 
h(s) = h(s') because B L mod 2 L — 0. Thus — unsurprisingly — both the C++ and Java 
implementations set B to an odd integer. 

Suppose that B is odd. The compression function is then strongly permuting. Thus, 
by choosing the initial value at random, we have uniform hashing. 

When B is odd, we have that B + 1 is even, so that (B + l) L mod 2 L = 0. By 
the binomial theorem, we have that = (B + l) L mod2 L = LLo B<r ((t) mod 2 L ) 
mod 2 L . Thus — irrespective of the initial value — the two strings of length L+ 1 given 



by the characters (£) mod 2 L for k = 0, 1 , . . . ,L and 00 • • • collide when B is odd. 
By this construction, PowerOfTwo cannot be almost universal unless we limit the 
length of the strings to at most L characters. 



